\subsection{Trees}
Recall that the language \( \qty{0^k1^k \mid k > 0} \) is context-free but not regular, so context-free languages are indeed a proper superset of regular languages.
The structure of regular derivations was very simple; each intermediate step was of the form \( wA \) for a word \( w \) and a variable \( A \in V \).
However, the structure of context-free derivations is more complicated: we use a parse tree instead of a linear derivation.
\begin{definition}
	A set \( T \subseteq \mathbb N^\star \) is called a \emph{(finitely-branching) tree} if it is closed under initial segments, and for every \( t \in T \), there is a \emph{branching number} \( n \in \mathbb N \) such that for all \( k \), the sequence \( tk \) lies in \( T \) if and only if \( k < n \).
	A node \( t \in T \) with no sucessors is called a \emph{leaf}.
	The empty sequence, which is an element of every tree, is called the \emph{root}.
	A node \( t \in T \) has \emph{level} \( k \) if the length of the sequence is \( k \), so \( \abs{t} = k \).
	If \( T \) is finite, there is a maximum level, called the \emph{height} of the tree.
	For a node \( t \in T \), the sequence \( \eval{t}_0, \eval{t}_1, \dots, \eval{t}_{\abs{t}} = t \) is called the \emph{branch} leading to \( t \).
\end{definition}
\begin{example}
	This is an example of a tree.
	\begin{center}
		\begin{tikzcd}
			&&& \varepsilon \\
			&& 0 & 1 & 2 \\
			00 & 01 && 10 & 20 & 21 & 22 \\
			000 & 001 & 100 & 101 && 210 & 220 & 221 \\
			& 0010
			\arrow[from=1-4, to=2-3]
			\arrow[from=1-4, to=2-4]
			\arrow[from=1-4, to=2-5]
			\arrow[from=2-4, to=3-4]
			\arrow[from=3-4, to=4-3]
			\arrow[from=3-4, to=4-4]
			\arrow[from=2-3, to=3-1]
			\arrow[from=2-3, to=3-2]
			\arrow[from=3-1, to=4-1]
			\arrow[from=3-2, to=4-2]
			\arrow[from=4-2, to=5-2]
			\arrow[from=2-5, to=3-5]
			\arrow[from=2-5, to=3-6]
			\arrow[from=3-6, to=4-6]
			\arrow[from=2-5, to=3-7]
			\arrow[from=3-7, to=4-7]
			\arrow[from=3-7, to=4-8]
		\end{tikzcd}
	\end{center}
\end{example}
\begin{definition}
	Let \( T \) be a tree and \( t \in T \).
	Then \( T_t = \qty{s \mid ts \in T} \) is the \emph{subtree starting from \( t \)}.
\end{definition}
\begin{definition}
	We define a partial order on \( T \) by \( t < s \) if \( t \neq s \) and if there exists \( k \) such that \( t(k) \neq s(k) \) and \( k_0 \) is minimal with this property, then \( t(k_0) < s(k_0) \).
	This is called the \emph{left-to-right order}.
\end{definition}
\begin{remark}
	This order is only a partial order since it does not order two distinct nodes that lie on the same branch, for example, \( 0 \) and \( 00 \).
	For each level \( k \), the nodes of length \( k \) are totally ordered.
	The leaves are totally ordered.
\end{remark}

\subsection{Parse trees}
\begin{definition}
	Let \( G \) be a context-free grammar.
	A pair \( \mathbb T = (T, \ell) \) is a \emph{\( G \)-parse tree} if \( T \) is a finitely-branching tree and \( \ell \colon T \to \Omega \) is a \emph{labelling function} such that:
	\begin{enumerate}
		\item \( \ell(\varepsilon) \in V \), we say \( T \) \emph{starts with} \( \ell(\varepsilon) \);
		\item if \( \ell(t) \in \Sigma \), \( t \) has no successors;
		\item if \( t \) has \( n + 1 \) successors and \( \ell(t) = A \in V \), then \( A \to \ell(t0) \ell(t1) \dots \ell(tn) \in P \).
	\end{enumerate}
	If \( \mathbb T = (T,\ell) \) is a \( G \)-parse tree, and \( t_0, \dots, t_m \) are its leaves written in the left-to-right order, then the \emph{string parsed by \( \mathbb T \)} is \( \sigma_{\mathbb T} = \ell(t_0) \dots \ell(t_m) \).
\end{definition}
\begin{remark}
	If \( t \in T \), \( \sigma_{\mathbb T} = \alpha \sigma_{\mathbb T_t} \beta \) where \( \mathbb T_t = (T_t, \ell_t) \), \( \ell_t(s) = \ell(ts) \).
\end{remark}
\begin{proposition}
	Let \( G \) be a context-free grammar.
	Then \( w \in \mathcal L(G) \) if and only if there is a \( G \)-parse tree \( \mathbb T \) starting from \( S \) such that \( \sigma_{\mathbb T} = w \).
\end{proposition}
\begin{proof}
	Observe that certain sequences of parse trees correspond to derivations.
	In particular, a sequence \( \mathbb T_0, \dots, \mathbb T_n \) of \( G \)-parse trees is \emph{derivative} if \( \mathbb T_0 = (\qty{\varepsilon}, \ell_0) \) with \( \ell_0(\varepsilon) = S \), and \( T_{i+1} \supseteq T_i \) is constructed by considering a leaf \( t \in T_i \) such that \( \ell_i(t) = A \in V \) and \( A \to x_0 \dots x_n \in P \), and giving it \( n + 1 \) successors with \( \ell_{i+1}(tk) = x_k \).
	There is a one-to-one correspondence between \( G \)-derivations starting from \( S \) and such derivative sequences of parse trees.
	In particular, any derivation yields a derivative sequence of parse trees, and hence the last parse tree in the sequence has \( \sigma_{\mathbb T_n} = w \).

	Conversely, given a parse tree \( \mathbb T \), it suffices to construct such a derivative sequence of parse trees, because then the correspondence yields a derivation as required.
	We start with the trivial tree \( \mathbb T_0 = \qty(\qty{\varepsilon}, \eval{\ell}_{\qty{\varepsilon}}) \).
	In each step, suppose \( \mathbb T_0, \dots, \mathbb T_i \) already form a derivative sequence, and \( T_i \neq T \).
	Let \( t \in T \setminus T_i \).
	Then there is a terminal node in \( T_i \) on the branch containing \( t \) in \( T \), which is not a terminal node in \( T \).
	We can then create \( T_{i+1} \) by adding the \( T \)-successors of \( t \) to \( T_i \).
	Since \( T \) is finite, after finitely many steps we are done.
	In particular, \( \mathbb T_0, \dots, \mathbb T_n \) is a derivative sequence, and thus \( S \xrightarrow G \sigma_{\mathbb T_n} = \sigma_{\mathbb T} = w \) as required.
\end{proof}
Suppose \( \mathbb T \) is a parse tree and \( t \in T \) such that \( \ell(t) = A \), and \( \mathbb T' \) is a parse tree starting from \( A \).
Then, we can remove the subtree \( T_t \), and replace it with \( T' \), which also yields a parse tree.
This technique is known as \emph{grafting}.
\begin{definition}
	We define \( \graft(\mathbb T, t, \mathbb T') = (S, \ell^\star) \) where
	\[ S = \qty{s \in T \mid t \not\subseteq s} \cup \qty{tu \mid u \in T'} \]
	and
	\[ \ell^\star(s) = \begin{cases}
		\ell(s) & t \not\subseteq s \\
		\ell'(u) & \exists u \in T',\, s = tu
	\end{cases} \]
\end{definition}
Then we have
\[ \sigma_{\graft(\mathbb T, t, \mathbb T')} = \alpha \sigma_{\mathbb T'} \beta;\quad \sigma_{\mathbb T} = \alpha \sigma_{\mathbb T_t} \beta \]

\subsection{Chomsky normal form}
\begin{definition}
	A grammar is in \emph{Chomsky normal form} if all of its rules are of the form \( A \to BC \) or \( A \to a \).
	Rules of the form \( A \to BC \) are called \emph{binary}; rules of the form \( A \to a \) are called \emph{unary}.
\end{definition}
Every grammar in Chomsky normal form is context-free.
\begin{lemma}
	Let \( G \) be a grammar in Chomsky normal form, and \( w \in \mathcal L(G) \) with \( \abs{w} = n \).
	Then every \( G \)-derivation of \( w \) has length \( 2n - 1 \).
\end{lemma}
\begin{proof}
	Binary rules increment the length, and increment the variable count.
	Unary rules preserve the length, and decrement the variable count.
	Since \( w \) is comprised only of letters, exactly \( n - 1 \) binary rules and \( n \) unary rules were used.
\end{proof}
We will show that every context-free grammar is equivalent to a Chomsky normal form grammar, and there is an algorithm to produce such a grammar.
There are three types of rules that are obstructions to a context-free grammar being in Chomsky normal form:
\begin{enumerate}
	\item rules \( A \to \alpha \) where \( \abs{\alpha} \geq 2 \) and \( \alpha \) contains a letter;
	\item rules of the form \( A \to B \), called \emph{unit rules}.
	\item rules \( A \to \alpha \) where \( \abs{\alpha} > 2 \) and \( \alpha \) contains only variables.
\end{enumerate}
Suppose we have a rule of the form \( A \to \alpha \) where \( \abs{\alpha} \geq 2 \), and \( \alpha \) contains a letter.
For each letter \( a \in \Sigma \), we can add a variable \( X_a \) and a rule \( X_a \mapsto a \).
Then we convert \( \alpha \) to \( X(\alpha) \), where \( X \) is the map converting each \( a \) into \( X_a \).
Then \( \alpha \) contains no letter.
We can therefore suppose without loss of generality that a given context-free grammar has no rules of this form.

Now consider a unit rule \( A \to B \).
A grammar is called \emph{unit closed} if for all \( A \to B \in P \) and \( B \to \alpha \in P \), we also have \( A \to \alpha \in P \).
We can easily convert each grammar into an equivalent unit closed grammar by adding at most \( \abs{V} \cdot \abs{P} \) new rules.
If a context-free grammar \( G \) is unit closed, we will show that we can remove all unit rules to give a grammar \( G' \) without changing the language.
Clearly \( \mathcal L(G') \subseteq \mathcal L(G) \).
Suppose \( w \in \mathcal L(G) \), then \( w \) has a shortest \( G \)-derivation.
Suppose this \( G \)-derivation of \( w \) contains a unit rule, so
\[ S \xrightarrow G \alpha A \beta \xrightarrow G_1 \alpha B \beta \xrightarrow G w \]
where this use of the unit rule is the last such usage.
Since \( w \) contains no variables, we must have applied a rule \( B \xrightarrow G_1 \zeta \).
\[ S \xrightarrow G \alpha A \beta \xrightarrow G_1 \alpha B \beta \xrightarrow G \gamma B \delta \xrightarrow G_1 \gamma \zeta \delta \xrightarrow G w \]
where the \( B \xrightarrow G_1 \zeta \) is the first usage of a rule for \( B \).
Since \( \alpha B \beta \xrightarrow \gamma B \delta \) did not use any \( B \)-rule by assumption, by unit closure we can replace this derivation with
\[ S \xrightarrow G \alpha A \beta \xrightarrow G \gamma A \delta \xrightarrow G_1 \gamma \zeta \delta \xrightarrow G w \]
This is clearly shorter.
So the shortest \( G \)-derivation contains no use of a unit rule, so is also a \( G' \)-derivation.

Finally, let us consider a rule \( A \to \alpha \) where \( \abs{\alpha} > 2 \) and \( \alpha \) contains only variables.
Suppose \( \alpha = A_1 \dots A_n \).
We define new variables \( X_1, \dots, X_{n-2} \), and add new rules \( A \to A_1 X_1, X_1 \to A_2 X_2, \dots X_{n-2} \to A_{n-1} A_n \).
Then, performing this for all such rules, we obtain a grammar without any such rules.
This grammar is in Chomsky normal form.
\begin{theorem}[Chomsky]
	Let \( G \) be a context-free grammar.
	Then we can compute an equivalent context-free grammar \( G' \) in Chomsky normal form.
\end{theorem}
\begin{proof}
	Remove problems due to rules of the form \( A \to \alpha \) where \( \alpha \) contains a letter and has length at least 2.
	Form the unit closure, then remove unit rules.
	Remove problems due to rules of the form \( A \to A_1 A_2 A_3 \dots A_n \).
\end{proof}

\subsection{The pumping lemma for context-free languages}
\begin{definition}
	Let \( L \) be a context-free language, and let \( n \in \mathbb N \).
	Suppose that for all \( w \in L \) such that \( \abs{w} \geq n \), there are \( x, y, z, u, v \) such that \( w = xuyvz \), and \( \abs{uyv} \leq n \), \( \abs{uv} > 0 \), and for all \( k \), \( xu^kyv^kz \in L \).
	Then \( L \) satisfies the \emph{pumping lemma for context-free languages with pumping number \( n \)}.
	We call \( u,v \) the \emph{pump}.
\end{definition}
\begin{remark}
	The pump now has two parts, and one part may be the empty string.
	There is no longer a constraint on the position of the pump in a word; \( x \) and \( z \) may be of any length.
	The regular pumping lemma implies the context-free pumping lemma.
	Since there are uncountably many languages satisfying the regular pumping lemma, there are also uncountably many languages satisfying the context-free pumping lemma.
	In particular, the context-free pumping lemma cannot characterise the countable class of all context-free languages.
\end{remark}
\begin{theorem}
	Every context-free language satisfies the context-free pumping lemma for some pumping number \( n \).
\end{theorem}
\begin{proof}
	Let \( L \) be a context-free language.
	Then \( L \) has a Chomsky normal form grammar \( G = (\Sigma, V, P, S) \), so \( \mathcal L(G) = L \).
	Let \( m = \abs{V} \), and \( n = 2^m \).
	We claim \( n \) is a pumping number for \( G \).

	If \( \mathbb T \) is a \( G \)-parse tree where the height of \( \mathbb T \) is \( h + 1 \) and \( \sigma_{\mathbb T} \) is a word, then \( \abs{\sigma_{\mathbb T}} \leq 2^h \).
	Indeed, the largest possible tree of height \( h + 1 \) has \( 2^{h + 1} \) leaves.
	Since \( \sigma_{\mathbb T} \) is a word, we must have applied a unary rule for each letter.
	Every unary rule reduces the amount of leaves by one.
	Thus, the tree must contain \( 2^{h+1} - \abs{\sigma_{\mathbb T}} \) leaves.
	Hence \( \abs{\sigma_{\mathbb T}} \leq 2^h \).

	Consider a word \( w \in \mathcal L(G) \) with \( \abs{w} \geq 2^m = n \).
	Then, if \( \mathbb T \) is a \( G \)-parse tree of \( w \), so \( \sigma_{\mathbb T} = w \), we know by the previous claim that the height of \( \mathbb T \) is at least \( m + 1 \).
	Let \( t \in T \) such that the length of the branch to \( t \) is the height \( h \geq m + 1 \) of the tree.
	Then, the path from \( \varepsilon \) to \( t \) has \( h + 1 \) labels, so contains \( h \) variables and one letter.

	Let \( s \) be an element of the branch of \( t \) such that the height of the subtree \( T_s \) is exactly \( m + 1 \).
	Hence \( \abs{\sigma_{\mathbb T_s}} \leq 2^m = n \).
	In particular, the path from \( s \) to \( t \) has \( m + 2 \) labels, so contains exactly \( m + 1 \) variables and one letter.
	By the pigeonhole principle, there are two nodes \( t_0 \subsetneq t_1 \) on the branch from \( s \) to \( t \) with the same label \( A \in V \).
	Let
	\[ \sigma_{\mathbb T} = a\sigma_{\mathbb T_s} b;\quad \sigma_{\mathbb T_s} = x' \sigma_{\mathbb T_{t_0}} z';\quad \sigma_{\mathbb T_{t_0}} = u \sigma_{\mathbb T_{t_1}} v;\quad \sigma_{\mathbb T_{t_1}} = y \]
	Then let \( x = ax';\quad z = z'b \) in the definition of the pumping lemma.
	Since \( t_0 \neq t_1 \), we have \( \abs{uv} > 0 \).
	Note that \( uyv = \sigma_{\mathbb T_{t_0}} \), which has length at most \( \abs{\sigma_{\mathbb T_s}} \), which has length at most \( 2^m = n \).

	Pumping down is accomplished by grafting \( T_{t_1} \) into \( T_{t_0} \); conversely, pumping up is accomplished by iteratively grafting \( T_{t_0} \) into \( T_{t_1} \).
	Define \( \mathbb T_{(0)} = \mathbb T_{t_1} \), and \( \mathbb T_{(i+1)} = \graft(\mathbb T_{t_0}, t_1, \mathbb T_{(i)}) \).
	Then \( \mathbb T_k = \graft(\mathbb T, t_1, \mathbb T_{(k)}) \), and \( \sigma_{\mathbb T_k} = xu^kyv^kz \), thus proving the pumping lemma.
\end{proof}
\begin{example}
	Consider the language \( L = \qty{a^n b^n c^n \mid n > 0} \) is not context-free.
	Suppose it is context-free.
	Then it has a pumping number \( N \in \mathbb N \).
	Consider the word \( a^N b^N c^N \in L \).
	Then \( a^N b^N c^N = xuyvz \) where \( \abs{uyv} \leq N \), \( \abs{uv} > 0 \).
	Since \( \abs{uyv} \leq N \), the string \( uyv \) cannot consist of all three letters \( a, b, c \).
	In any case, pumping up the string will increase the quantity of one letter, but not increase the quantity of some other letter.
	Then the new word does not lie in \( L \).
\end{example}

\subsection{Closure properties}
We have seen that \( L = \qty{a^nb^nc^n \mid n > 0} \) is not context-free.
However, \( L_0 = \qty{a^nb^nc^k \mid n, k > 0} \) and \( L_1 = \qty{a^kb^nc^n \mid n, k > 0} \) are context-free, as the concatenation of context-free languages.
But the intersection \( L_0 \cap L_1 \) is exactly \( L \), so context-free languages are not closed under intersection.
Therefore, they are also not closed under complement or difference, because this, along with closure under union, would imply closure under intersection.
Note that any model of computation corresponding to context-free grammars cannot have a product construction, because such a construction would imply closure of context-free languages under intersection.

It can be shown that context-free languages correspond to \emph{pushdown automata}, which are similar to deterministic automata, except that they also have a \emph{stack}, which is a way of storing a sequence of arbitrary symbols.
The stack has a \emph{push} operation allowing a symbol to be pushed onto the top of the stack, and a \emph{pop} operation that removes the topmost element currently on the stack.
In particular, the stack does not have random access, and any symbol pushed can be popped at most once.
Sequences that are pushed onto the stack are popped off in reverse order.

The transition function \( \delta \) has an additional input denoting the topmost element currently on the stack, and an additional output describing an operation to perform on the stack, if any.
\begin{theorem}
	A language is context-free if and only if there is a pushdown automaton for the language.
\end{theorem}

\subsection{Decision problems}
The word problem is already solved for context-free languages.
The emptiness problem can be solved by the pumping lemma, similarly to the solution for regular languages.
Indeed, if \( n \) is a pumping number, no word with length at most \( n \) can be the shortest word, since it can be pumped down.
So we can explicitly check each word of length less than \( n \) to solve the emptiness problem.
Note that the choice of pumping lemma to use does not matter in this argument.
